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Abstract 

Complex computer codes are often too time expensive to be directly used to 
perform uncertainty propagation studies, global sensitivity analysis or to solve op- 
timization problems. A well known and widely used method to circumvent this 
inconvenience consists in replacing the complex computer code by a reduced model, 
called a metamodel, or a response surface that represents the computer code and 
requires acceptable calculation time. One particular class of metamodels is stud- 
ied: the Gaussian process model that is characterized by its mean and covariance 
functions. A specific estimation procedure is developed to adjust a Gaussian pro- 
cess model in complex cases (non linear relations, highly dispersed or discontinuous 
output, high dimensional input, inadequate sampling designs, etc.). The efficiency of 
this algorithm is compared to the efficiency of other existing algorithms on an analyt- 
ical test case. The proposed methodology is also illustrated for the case of a complex 
hydrogeological computer code, simulating radionuclide transport in groundwater. 
Keywords: Gaussian process, kriging, response surface, uncertainty, covariance, 
variable selection, computer codes. 
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1 INTRODUCTION 



With the advent of computing technology and numerical methods, investigation of 
computer code experiments remains an important challenge. Complex computer 
models calculate several output values (scalars or functions) which can depend on 
a high number of input parameters and physical variables. These computer models 
are used to make simulations as well as predictions or sensitivity studies. Impor- 
tance measures of each uncertain input variable on the response variability provide 
guidance to a better understan ding of the m o deling i n order t o redu ce the response 



Helton et al 



uncer tainties most effectively (partem et ajj (|2000l ). IKleiinenl (|1997l ) 
([20od)). 

However, complex computer codes are often too time expensive to be directly 
used to conduct uncertainty propagation studies or global sensitivity analysis based 
on Monte Carlo methods. To avoid the problem of huge calculation time, it can be 
useful to replace the complex computer code by a mathematical approximation, called 
a respon se surface or a su r rogat e model or also a metamodel. The response surface 
method (IBox and Draper! (119871 )) consists in constructing a function that simulates 
the behavior of real phenomena in the variation range of the influential parame- 
ters, starting from a certain number of experiments. Similarly to this theory, some 



methods have been developed to bu i 



Sacks et al 



d su rr ogates for long running computer codes 



(jl989l ) , lOsio and AmonI (jl996l ) , Kleijnen and Sargent] ( 



20f)(]h . lFang et al 



2006)). Several metamodels are classically used: polynomials, splines, generalized 



linear models, or learning s t atisti c al models such as neural networks, support vector 



machines, . . . (IHastie et al 



(2002), 



Fang et al 



|2006j)). 



For sensitivity analysis and uncertainty propagation, it would be useful to obtain 
an analytic predictor formula for a metamodel. Indeed, an analytical formula often 
allows the direct calculation of sensitivity indices or output uncertainties. Moreover, 
engineers and physicists prefer interpretable models that give some understanding 
of the simulated physical phenomena and parameter inter a ctions . Some me t amod - 
els, such a s poly nomials (jJourdan and Zabalza-Mezghanil (|2004l ). IKleiinenl (|2005l ). 



Iooss et al 



(|2006l )). are ea sily interpre t able b u t not alway s very efficient. Others, for 



instance neural networks (Alametal 



|200J), iFang et al.l (|2006r 0. are more efficient 



but do not provide an analytic predictor formula. 
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The kriging method (jMatheronl (|1970l ). ICressid (|1993l )) has been developed for 
spatial interpolatio n problems; i t takes into account spatial statistical structure of the 



estimated variable. 



Sacks et al. 



(|1989l ) have extended the kriging principles to com- 
puter experiments by considering the correlation between two responses of a computer 
code depending on the distance between input variables. The kriging model (also 
called Gaussian process model), characterized by its mean and covariance functions, 
presents several advantages, especially the interp olation and interpre t ability proper- 



Currin et al 



f)199lh . 



Santner et al 



ties. M oreo ver, numerous a uthor s (for example, 
(2003) and IVazquez et al.l (|2005l )) show that this model can provide a statistical 
framework to compute an efficient predictor of code response. 

From a practical standpoint, constructing a Gaussian process model implies es- 
timation of several hyperparameters included in the covariance function. This opti- 



mization problem is pa rticularly difficult f or a mode 



quate sampling designs (jFang et al 



with many inputs and inade- 



(|2006l ). IO'Haganl (|2006l )). In this paper, a special 
estimation procedure is developed to fit a Gaussian process model in complex cases 
(non linear relations, highly dispersed output, high dimensional input, inadequate 
sampling designs). Our purpose includes developing a procedure for parameter esti- 
mation via an essential step of input parameter selection. Note that we do not deal 
with the design of experiments in computer code simulations (i.e. choosing values 
of input parameters). I ndeed, we work on d ata obtained in a previous study (the 
hydrogeological model of IVolkova et al.1 (120081 )) and try to adapt a Gaussian process 
model as well as possible to a non-optimal sampling design. In summary, this study 
presents two main objectives: developing a methodology to implement and adapt a 
Gaussian process model to complex data while studying its prediction capabilities. 

The next section briefly explains the Gaussian process modeling from theoretical 
expression to predictor formulation and model parameterization. In section 3, a 
parameter estimation procedure is introduced from the numerical standpoint and 
a global methodology of Gaussian process modeling implementation is presented. 
Section 4 is devoted to applications. First, the algorithm efficiency is compared to 
other algorithms for the example of an analytical test case. Secondly, the algorithm 
is applied to the data set (20 inputs and 20 outputs) coming from a hydrogeological 
transport model based on waterflow and diffusion dispersion equations. The last 
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section provides some possible extensions and concluding remarks. 



2 GAUSSIAN PROCESS MODELING 

2.1 Theoretical model 

Let us consider n realizations of a computer code. Each realization y(x) of the com- 
puter code output corresponds to a d-dimensional input vector x = (x±, xj). The 
n points corresponding to the code runs are called an experimental design and are 
denoted as X s = (x^\ ...,x^). The outputs will be denoted as Y s = (y^\ ...,?/ n )) 
with yW = y(x^),i = 1, ...,n. Gaussian process (Gp) modeling treats the determin- 
istic response y(x) as a realization of a random function Y(x), including a regression 
part and a centered stochastic process. This model can be written as: 

Y(x) = f{x) + Z{x). (1) 

The deterministic function f(x) provides the mean approximation of the computer 
code. Our study is limited to the parametric case where the function / is a linear 
combination of elementary functions. Under this assumption, f(x) can be written as 
follows: 

k 
j=0 

where (3 = [Pq, . . . , is the regression parameter vector and F{x) = [fo(x), . . . , fk(x)] 
is the regression matrix, with each fj (j = 0, . . . , k) an elementary function. In the 
case of the one-degree polynomial regression, (d + 1) elementary functions are used: 

' fo(x) = 1, 

< 

k fi(x) = Xi for i = l,...,d. 

In the following, we use this one-degree polynomial for the regression part, while 
our methodology can be extended to other bases of regression functions. The regres- 
sion part allows the addition of an external drift. Without prior information on the 
relation between the model output and the input variables, this quite simple choice 
appears reasonable. Indeed, adding this simple external drift allows for a nonstation- 
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ary global model even if the stochastic part Z is a stationary process. Moreover, on 
our tests of section HI this simple mod el does not a ffect our prediction performance. 

(1989). 



This simplification is also reported by 



Sacks et al. 



The stochastic part Z{x) is a Gaussian centered process fully characterized by its 
covariance function: Cov(Z(a;), Z(u)) = a 2 R(x, u), where a 2 denotes the variance of 
Z and R is the correlation function that provides interpolation and spatial correlation 
properties. To simplify, a stationary process Z(x) is considered, which means that 
correlation between Z{x) and Z(u) is a function of the difference between x and 
u. Our study is focused on a particular family of correlation functions that can be 
written as a product of one-dimensional correlation functions: 



Cov(Z(x), Z(u)) = a 2 R(x - u) = a 2 ]J R t (xi 



ui) 



1=1 



Abrahamsenl (| 19941 ), 



Sacks et al 



dl989j) , IChiles and Delfinerl (119991 1 and 
(|2006l ) give lists of correlation functions with their advantages and drawbacks. Among 
all these functions, we choose to use the generalized exponential correlation function: 



Rasmussen and Williams 



R, 



e,p( x 



u 



Y[exp(-9i\xi - ui\ Pl ) with 0, > and < pj < 2, 



1=1 



where 9 = [9\, . . . , and p = [pi, • • ■ ,Pd]* are the correlation parameters. Our mo- 
tivations stand on the derivation and regularity properties of this function. Moreover, 
different choices of covariance parameters allow a wide spectrum of possible shapes 
(Figure [T]); p = 1 gives the exponential correlation function and p = 2 the Gaussian 
correlation function. 

Even for deterministic computational codes (i.e. outputs corresponding to the 
same inputs are identical), the outputs may be subject to noise (e.g. numerical 
noise). In this case, an independent white noise U(x) is added in the stochastic part 
of the model: 

Y{x)= f(x) + Z{x) + U{x), (2) 

where U(x) is a centered Gaussian variable with variance e 2 = a 2 r. In terms of 
covariance function, this white noise introduces a discontinuity at the origin called 
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Exponential correlation function (1 D) Generalized exponential correlation function (1 D) 

for different correlation parameter (power parameter p=1 ) f ° r different power parameter (8=1) 

1 s 



9 


= 0.5 


e 


= 1 


9 


= 3 




power parameter = 0.5 

power parameter = 1 

power parameter = 2 




Figure 1: Generalized exponential correlation function for different power and correlation 
parameters. 



the nugget effect (jMatheronl (|1970l) ): 



Gov(y(aj), Y(u)) = a 2 (Rq p {x - u) + t5(x - u)) 

where S(v) - 

2.2 Joint and conditional distributions 



1 if v = 0, 
otherwise. 



Under the hypothesis of a Gp model, the learning sample Y s follows the multivariate 
normal distribution 

p(Y s \X s , (3, a, 6, p, r) = M (F s (3, E s ) , 

where F s = [F(x^Y, . . .^{x^) 1 } 1 is the regression matrix and 

S s = a 2 (R e p (x« - . , =1 n + Tin 

is the covariance matrix with I n the n-dimensional identity matrix. 
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If a new point x* = (x*, ■-,Xj) is considered, the joint probability distribution of 
(Y s ,Y(x*)) is : 



p(Y s ,Y(x*)\X s ,x*,(3,a,6,p,T) = M 



F, 
F(x*) 



P, 



V s k(x*) 
k(x*Y a 2 {l + r) 



(3) 



with 



k(x*) = (Cov(y( 1 ),y(s*)),...,Cov(y("),y(a;*)))* 

= a 2 (R (xW,x*) + t5(xW ,x*), . . . , R e ^(x( n \x*) +t5{x^\x 



(4) 



By conditioning this joint distribution on the learning sa mple, we can read ily 
obtain the conditional distribution of Y(x*) which is Gaussian (Ivon Misesl ()1964l )): 



p(Y(x*)\Y s ,X s ,x*,/3,a,0,p,T) 

= M (E[Y (x * ) | Y s , X s , x * , (3, a, 6 , p, r] , Var [Y (x * ) | Y s , X s , x* , (3, a, 6 , p, r] ) , 



(5) 



with 



E[Y{x*)\Y s ,X s ,x*,f3,a,0,p,T]=F{x*)(3 + k{x*) t '£j l (Y s -F s (3), 
Vax[Y(x*)\Y s ,X s , x*,(3, a, 0,p, r] = a 2 (l + r) - k{x*f^- x k{x*). 



(6) 



The conditional mean (equation ([6])) is used as a predictor. The variance formula 
corresponds to the mean squared error (MSE) of this predictor and is also known as 
the kriging variance. This analytical formula for MSE gives a local indicator of the 
prediction accuracy. More generally, Gp model provides an analytical formula for 
the distribution of the output variable at an arbitrary new point. This distribution 
formula can be used for sen sitivity and uncertainty analysis, as well as for quantile 
evaluation (jO'Haganl (|2006l )). Its use can be completely or partly analytical and 
avoids costly methods based for example on a Monte Carlo algorithm. The variance 
expres sion can also be used in sampling strategies (jScheidt and Zabalza-Mezghani 



(J200J)). All these con siderations and poss i ble extensions of Gp mode 



significant advantages (ICurrin et al.l (Il99ll ). iRasmussen and Williams! (120061 )) 



ing r epresent 
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2.3 Parameter estimation 

To compute the mean and variance of a Gp model, estimation of several parameters 
is needed. Indeed, the Gp model ((2]) is characterized by the regression parameter 
vector (3, the correlation parameters (0,p) and the variance parameters (ct 2 ,t). The 
maximum likelihood method is commonly used to estimate these parameters. Given 
a Gp model, the log-likelihood of Y s can be written as: 

l Ys ((3, 0,p, a, t) = ~ ln(2vr) - | ln(<r 2 ) - i ln(det(^ p + r/ n )) 

{Ys ~ Fsf3)t( - R 0,p + Tl n)~ l (Ys ~ F s (5). 

Given the correlation parameters (0,p) and the variance parameter r, the maximum 
likelihood estimator of (3 is the generalized least squares estimator: 



= (F s t (RQp + rlr^Fs)- 1 F s \R 6p + Tl n )~ l Y s , 



(7) 



and the maximum likelihood estimator of a 2 is: 



a 2 = -{Y s - F s p)\R ep + Tl n )-\Y S - F a 0). 



n 



(8) 



Remark 2.1 If we consider the predictor built on the conditional mean (equation tfty), 
we replace (3 by its estimator (3. The predictor writes now 



Y ( x *)\Y e ,x s ,x*,*,G, P> T = F ( X *)P + k(x*y^7 L (Y s - F s (3) 



and its MSE has consequently an additional component ItSantner et all $2003 )): 



Var[Y(x*)\Y s ,X s ,x*,a,e,p,T] = ^(l+r)-fc( a; *) t 5]7 i fc(a:*)+u(a;*)(F s 'S7 1 F s )- 1 'u(a ; * vi 



with u(x*) = F(x*) - k(x*fEj l F s . 

Matrix Rq p depends on and p. Consequently, j3 and a 2 depend on 0, p and r. 
Substituting (3 and a 2 into the log-likelihood, we obtain the optimal choice (0,p,r) 
which maximizes: 



1 



<p(0,p,r) = -- nln(a 2 ) + Id.(\Rq p + rl n |) where \RQ^ p +rI n \ = det(RQ p +Tl n 
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Thus, estimation of (9,p) and r consists in numerical optimization of the function tp 
defined as follows: 



(9, p, t ) = arg min ip(9, p, r) with ip{9, p, r) 

9,p,r 



\ R 9,p + Tl n\ n 



Our study is focused on complex cases with large dimensions d for the input vector 
x (d = 20 in our second example in section U]), where the sampling design has not 
been chosen as a uniform grid. In this setting, minimizing function ip{9, p,r) is an 
optimization problem that is numerically costly and hard to solve. Several difficulties 
guide the choice of the algorithm. First, a large number of parameters imposes the 
use of a sequential algorithm, where different parameters are introduced step by step. 
Second, a large parameter domain due to the number of parameters and the lack of 
prior bounds requires an exploratory algorithm able to explore the domain in an 
optimal way. Finally, the observed irregularities of ?/>(#, p,r) due, for instance, to a 
conditioning problem induce local minima, which recommend the use of a stochastic 
algorithm rather than a descent algorithm. 

imi) 



Welch et al. 



Several algorithms have been proposed in previous papers, 
use the simplex search method and introduce a kind of forward selection algorithm 
in which correlation parameters are added step b y step to r e duce f unction ip(9,p,r). 
In Kennedy and O'Hagan's GEM-SA software (jO'Haganl (120061 )). which uses the 
Bayesian formalism, the posterior distribution of hyperparameters is maximized via 
a conjugate gradient method (the Powel meth o d is u sed as the numerical recipe). The 



DACE Matlab free toolbox (jLophaven et al.l (J2002J)) i ntroduces a powerfu 



tic algorithm based on the Hooke & Jeeves method (jBazaraa et al.1 (|1993l )). which 



stochas- 



unfortunately requires a starting point and some bounds to constrain the optimiza- 
tion. In complex applications, Welch's algorithm reveals some limitations and for 
high dimensional input, GEM-SA and DACE software cannot be applied directly on 
data including all the input variables. To solve this problem, we propose a sequential 
version (inspired by Welch's algorithm) of the DACE algorithm. It is based on the 
step by step inclusion of input variables (previously sorted) . Our methodology allows 
progressive parameter estimation by input variables selection both in the regression 
part and in the covariance function. The complete description of this methodology 
is the subject of the next section. 
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Remark 2.2 One of the problems we have to acknowledge in the evaluation ofip(0,p, r) 
is the condition number of the prior covariance matrix. This condition number affects 
the numerical stability of the linear system for the (3 determination and for the evalu- 
ation of the determinant. The degree of ill- conditioning not only depends on sampling 



design but is also sensitive to the underlying covariance model. \Ababou et all il994 ) 
showed, for example, that a Gaussian covariance (p = 2) implies an ill-conditioned 
covariance matrix (which leads to a numerically unstable system), while an exponen- 
tial covariance (p = 1) gives more stability. Moreover, in our case, the experimental 
design cannot be chosen and numerical parameter estimation is often damaged by 
the ill- conditioning problem. The nugget effect represented by r solves this problem. 
Although the outputs of the learning sample are no longer interpolated, this nugget 
effect improves the correlation matrix condition number and increases robustness of 
our estimation algorithm. 



3 MODELING METHODOLOGY 

Let us first detail the procedure used to validate our model. Since the Gp predictor 
is an exact interpolator (except when a nugget effect is included), residuals of the 
learning data cannot be used directly. So, to estimate the mean squar ed error in a 



non-q ptimistic way, we use either a i^-fold cross validation procedure (jHastie et al. 



(|2002l )) or a test sample (consisting of new data, unused in the building process 
of the Gp model). In both cases, the predictivity coefficient Qi is computed. Q2 
corresponds to the classical coefficient of determination R 2 for a test sample, i.e. for 
prediction residuals: 

Q 2 (Y, Y) = 1 \-_ 

Yh=i (Y-Yi) 

where Y denotes the nt es t observations of the test set and Y is their empirical mean. 
Y represents the Gp model predicted values, i.e. the conditional mean (equation 
([!])) computed which the estimated values of parameters (/3, a, 6,p, r). Other simple 
validation criteria can be used: the absolute er ror, the mean and standard d eviation 



of the relative residuals, . . . (see, for example, iKleiinen and Sargentl (|2000l )). which 



are all global measures. Some statistical and graphical analyses of residuals can 
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provide more detailed diagnostics. 

Our methodology consists in seven successive steps. A formal algorithmic defini- 
tion is specified for each step. For i = l,...,d, let ej denote the i th input variable. 
Aio = . . . , denotes the complete initial model (i.e. all the inputs in their 

initial ranking). M.\ = |e^\ . . . , and M.2 = je[ 2 \ • • • , refer to the inputs 
in new rankings after sorting by different criteria (correlation coefficient or variation 
of Qq). Finally, Ai C ov and M r eg denote the current covariance model and the cur- 
rent regression model; i.e. the list of selected inputs appearing in the covariance and 
regression functions. 

Step - Standardization of input variables 

The appropriate procedure to construct a metamod el requires space fi lling designs 



with good optimality and orthogonality properties (jFang et al.l (|2006l )). However, 
we are not always able to choose the experimental design, especially in industrial 
studies when the data have been generated a long time ago. Furthermore, other 
restrictions can be imposed; for example, a sampling design taking into account 
the prior distribution of input variables. This can have prejudicial consequences for 
hyperparameter estimation and metamodel quality. 

So, to increase the robustness of our parameter estimation algorithm and to opti- 
mize the metamodel quality, we recommend to transform all the inputs into uniform 
variables. In order to get each transformed input variable following an uniform dis- 
tribution U[0, 1], the theoretical distribution (if known) or the empirical ones after a 
piecewise linear approximation is applied to the original inputs. This approximation 
is required to avoid transforming a future unsampled x* to one of the transformed 
training sites, even if no element of x* is equal to the corresponding element of any of 
the untransformed training sites. We empirically observed that this uniform transfor- 
mation of the inputs seems well adapted to correctly estimate correlation parameters. 
Choices of bounds and starting points are also simplified by this standardization. 

Step 1 - Initial input variables ranking by decreasing coefficient of 
correlation between and Y 

Sorting input variables is necessary to reduce the number of possible models, espe- 
cially to dissociate regression and covariance models. Furthermore, direct estimation 
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of all parameters without an efficient starting point gives bad results. So, as a sort 
criterion, we choose the coefficient of correlation between the input variable and the 
output variable under consideration. The correlation coefficients between the input 
parameters and the o utput variable are t he simplest measures of the influence of 
inputs on the output ([Saltern et al.l (|2000i )). They are valid in the linear relation 
context, while in the nonlinear context, they give a first idea of the hierarchy among 
input variables, in terms of their influence on the output. Finally, this simple and 
intuitive choice does not require any modeling and appears a good initial method to 
sort the inputs when no other information is available. 

For a strongly nonlinear computer code, it could be interesting to use a qualitative 
metho d, independent of th e model complexity, in order to sort the inputs by influence 
order ([Helton et al.l ([2006)). Another possibility would be to fit an initial Gp model 
with an intercept only regression part and all components of p equal to 1 or 2. Only 
the correlation coefficients vector 9 has to be e stimated. Then, sens itivity measures 



tad) 



such as the Sobol indices (partem et al 
and used to sort the inputs by influence order. 
Algorithm 



Volkova et al 



(120081 )) are computed 



M 



.-{^,...,^>}-A< 1 -{^,...,^} 



Mreg = Ml 

M cov = Mi 

Step 2 - Initialization of the correlation parameter bounds and starting 
points for the estimation procedure 

To constrain the ip optimization, the DACE estimation procedure requires three 
following values for each correlation parameter: a lower bound, an upper bound and 
a starting point. These values are crucial for the success of the estimation algorithm, 
when it is used directly for all the input variables. However, using sequential esti- 
mation based on progressive introduction of input variables, we limit the problems 
associated with these three values, especially with the starting point value. Another 
way to reduce the importance of starting point and bounds is to increase the number 
of iterations in DACE estimation algorithm. However, in the case of a high number 
of inputs, increasing the number of iterations in DACE can become extremely time 
expensive; a compromise has to be found. As the input variables have been previously 
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transformed into standardized uniform variables, the initialization and the bounds of 
the correlation parameters can be the same for all the inputs: 

O lower bounds for each component of and p: lobg = 1CP 8 , lobp = 0, 
O upper bounds for each component of 6 and p: upbg = 100 , upbp = 2, 
O starting points for estimation of each component of 6 and p: 8° = 0.5 , 
p° = 1. 

Step 3 - Successive inclusion of input variables in the covariance function 

For each set of inputs included in the covariance function, all the inputs from 
the ordered set in the regression function are evaluated. Correlation and regression 
parameters are estimated by the DACE modified algorithm, with the values, esti- 
mated at the (i — l) th step for the same regression model, used as a starting point. 
More precisely, at step i, input variables numbered from 1 to i are included in the 
covariance function and the algorithm estimates pairs of the correlation parameters 
(01, Pl) for I = 1, . . . ,i. As the starting point, the algorithm uses correlation parame- 
ters obtained at the (i — l) th step for the starting values of . . . , 
First starting value of (9i,Pi) is fixed to an arbitrary reference value. Then, at each 
ste p, selection of variab les in the regression part is also made. 



Hoeting et al.l (|2006l ) recommends the corrected Akaike information criterion (AICC) 
for input selection in the regression model in order to take spatial correlations into 
account. Therefore, after the estimation of correlation and regression parameters, 
the AICC is computed: 

mi + mi + 1 



AICC = -2l Ys [f3,0,a)+2n 



n — mi — m2 — 2 

where m\ denotes the number of input variables in the regression function, mi those 
in the covariance function and ly the log- likelihood of the sample Y. The required 
model is the one minimizing this criterion. 
Algorithm 
For * = 1 ... d 

O Step 3.1: Variables in covariance function 
Mi jC ov , = Mcov(l, ■ ■ ■ ,«) 
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O Step 3.2: Successive inclusion of input variables in regression function 
For j = 1 . . . d 

• Regression Model: 

M jy reg = M reg (l, . . . , j) 

• Parameter estimation: 

Qinit = (fl^i-lJJ . . . Ji-^-V* 
p init = . . . 

[e^,p^\ = DACE estiumtion(M iyCOV , M jyreg , [0 ini \p ini % [lob e ,lob p ], [upb e ,upb p ]) 

• AICC Criterion computation 
AICC(z, j) = klCC{M l , cov ,M h re 9 ) 

End 

O Step 3.3: Optimal regression model selection: 

j°P tim (i) = argmin (AICC(i, j)) 

j 

O Step 3.4: Q2 evaluation by fT-fold cross validation or on test data (with 
current correlation model and optimal regression model) 

Q2(«) = Q2(Mi,coV, Mjoptim^^g) 

End 

This order (correlation outer, regression inner) can be justified by minimizing the 
computer time required for optimization. The selection procedure for the regression 
part is made by the minimization of AICC criterion which requires, at each step, only 
one parameter estimation. On the other hand, the covariance selection is made by 
the maximization of Q2 which is often computed by a K-io\d cross validation. This 
procedure requires, at each step, K estimation procedures. So, the loop on covariance 
selection is the more expensive, and consequently has to be outer. The choice of K 
depends on the number of observations of the data set, on the constraints in term of 
computer time and on the influence of the learning sample size on prediction quality. 
If few data are available, a leave-one-out cross-validation could be preferred to a 
i^-fold procedure to avoid an undesirably negative effect of small learning sets on 
prediction quality. 

Remark 3.1 To avoid some biases on the choice of the optimal covariance model in 
the next two steps, the coefficient Q2 has to be computed on a test sample (or by a 
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cross validation procedure), different from the one used for the final validation of the 
Gp model at step 7. 



Other cri t eria o f ten used in the optim ization of the computer experiment designs 



( Sacks et al 



(|1989l ). ISantner et al.l (|2003l )) could be considered to select the optimal 
regression and covariance model. These criteria are based on the variance of Gp 
model: they produce a model that minimizes the maximum or the integral of predic- 
tive variance over input space. However, in the case of a high number of inputs, the 
optimization of these criteria can be very computer time expensive. The advantage 
of the Q2 statistic is its relatively fast evaluation, while producing a final model that 
optimizes the predictive performance. 

Step 4 (optional) - New input variables ranking in the covariance function 
based on the evolution of Q2 (inputs sorted by decreasing "jumps" of Q2) 

This optional step improves the selection of inputs, particularly in the covariance 
function. For each input Xj, the increase of the Q2 coefficient (denoted AQ2W) 1S 
computed when this i th variable is added to the covariance function. This value is an 
indicator of the contribution of the i th input to the accuracy of the Gp model. For 
this reason, it can be judicious to use values AQ2(1)> • • • > AQiid) to sort the inputs 
included in the correlation function. The inputs are sorted by decreasing of values 
AQ2W an d the procedure of parameter estimation is repeated with this new ranking 
of inputs for the covariance function (step 3 is rerun). 
Algorithm 

• Evaluation of Q2 increase for each input variable included in the covariance 
function: 

AQa(fc) = Q 2 (l) 
For k = 2 ... d 

AQ 2 (k) = Qa(fc) -Q 2 (k- 1) 

end 

• Sorting input variables by decreasing of AQ2 
Mi => M2 

Step 5 (optional) - Algorithm for parameter estimation with new ranking 
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of input variables in the covariance function 

This optional step improves the selection of inputs, particularly in the covariance 
function. The procedure of parameter estimation (step 3) is repeated with the inputs 
sorted by decreasing values of in the covariance function. Consequently, 

correlation parameters related to the inputs that are the most influential for the 
increase of the Gp model accuracy are estimated in the first place. Furthermore, we 
can also hope that the use of this new ranking allows a decrease in the number of 
inputs included in the covariance function and an optimal input selection. The use 
of this new ranking appears more judicious and justifiable for the covariance function 
than sorting by decreasing correlation coefficient (cf. step 1). However, the ranking 
of step 1 is kept for the regression function. 
Algorithm 

Mreg = Ml 

< 

k Mcov = M 2 
Step 6 - Optimal covariance model selection 

For each set of inputs in the covariance function, the optimal regression model 
is selected based on minimization of the AICC criterion (cf. step 3.3). Then, the 
predictivity coefficient Q2 is computed either by cross validation or on a test sample 
(cf. step 3.4). Finally, the selected covariance model is the one corresponding to the 
highest Q2 value. 
Algorithm 

i optim = argmax (Q 2 (*)) 

i 

MzT = M cov (l,...,i°P Um ) 

M°£™ = M reg (l, . . . joptim^optim^ 

Step 7 - Final validation of the optimal Gp model 

After building and selecting the optimal Gp model, a final validation is neces- 
sary to evaluate the predictive performance and to eventually compare it to other 
metamodels. To do this, coefficient Q2 is evaluated on a new test sample (i.e. data 
not used in the building procedure). If only few data are available, a cross validation 
procedure can be considered. So, two cross validation procedures are overlapped; one 
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for building the model and one for its validation. 
Algorithm 

After all the steps of our algorithm (including the step 5), we can often link the 
inputs appearing in the covariance and regression functions with the nature of their 
effects on the output. Indeed, we can generally observed 4 cases: the inputs with 
only a linear effect which are supposed to appear only in the regression and excluded 
from the covariance with the step 5, the inputs with only a non-linear effect which 
are excluded from the regression and can then appear in the covariance with the re- 
ordering of M.cov at step 5, the inputs with both effects appearing in the regression 
and covariance functions and, finally, the inactive input variables excluded from both. 



4 APPLICATIONS 

4.1 Analytical test case 

First, an analytical function called the g-function of Sobol is used to illustrate and 
justify our methodology. The g-function of Sobol is defined for d inputs uniformly 
distributed on [0, l] d : 

ffsoboiC^i, ...,X d ) = f\ g k (X k ) where g k (X k ) = ^ Xk 2 I + flfc anc } fflfc > q 

k=l 

Because of its complexity (strongly nonlinear and non-monotonic relationship) and 
the availability of analytical sensitivity indices, the g-function of Sobol is a well known 



test e xample in the studies of global sensitivity analysis algorithms (jSaltelli et al. 



(2000)). The contribution of each input X k to the variability of the model output is 



represented by the weighting coefficient a k . The lower this coefficient a k , the more 
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significant the variable X^. For example: 





= -» 


Xfc is very important, 


ak 




Xfc is relatively important 


ak 


= 9 -» 


Xfc is non important, 


k ak 


= 99 - 


-> Xfc is non significant. 


we 


choose 


a k = k. 



Applying our methodology to the g-function of Sobol, we illustrate its different 
steps, especially the importance of rerunning the estimation procedure after sorting 
the inputs by decreasing AQ2 (cf. steps 4 and 5). At the same time, compa risons are 



made with other reference software like, for example, the GEM-SA software (lO'Hagan 

I 



(|200fl ). freely available at http://www.ctcd.group.shef.ac.uk/gem.html). 



To do this, different dimensions of inputs are considered, from 4 to 20: d = 
4, 6, ... , 20. For each dimension d, we generate a learning sample formed by Nls = 
d x 10 simulatio ns of the g-fun c tion o f Sobol following the Latin Hypercube Sampling 



(LHS) method (IMcKay et al.1 (|1979l )). Using these learning data, two Gp models 
are built: one following our methodology and one using the GEM-SA software. For 
each method, the Q2 coefficient is computed on a test sample of Nts = 1000 points. 
For each dimension d, this procedure is repeated 50 times to obtain an average 
performance in terms of the prediction capabilities of each method (mean of Q2)- 
The standard deviation of Q2 is also a good indicator of the robustness of each 
method. 

For each dimension d, the mean and standard deviation of Q2 computed on the 
test sample using different methods are presented in Table [TJ Three methods are 
compared: the GEM-SA software, our methodology without steps 4 and 5, and our 
methodology with steps 4 and 5. 

For the values of d higher than 6, our methodology including double selection 
of inputs (with steps 4 and 5) clearly outperforms the others. More precisely, the 
pertinence of rerunning the estimation procedure after sorting the inputs by decreas- 
ing AQ2 is obvious. The prediction accuracy is much more robust (lower standard 
deviation of Q2)- 

The drawback of our methodology lies in the somewhat costly steps 4 and 5. 
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g-Sobol GEM-SA software Gp methodology Gp methodology 

simulations without steps 4 and 5 with steps 4 and 5 



d 


ft T 

N L s 


Q2 


sd 


Q2 


sd 


Q2 


sd 


4 


40 


0.82 


0.08 


0.60 


0.21 


0.86 


0.07 


6 


60 


0.67 


0.24 


0.59 


0.16 


r\ or 

0.85 


0.05 


8 


80 


0.66 


0.13 


0.61 


0.10 


0.85 


0.04 


10 


100 


0.59 


0.25 


0.63 


0.13 


0.83 


0.05 


12 


120 


0.57 


0.16 


0.61 


0.15 


0.84 


0.05 


14 


140 


0.60 


0.17 


0.61 


0.14 


0.83 


0.03 


16 


160 


0.62 


0.11 


0.67 


0.06 


0.86 


0.04 


18 


180 


0.66 


0.09 


0.67 


0.05 


0.84 


0.03 


20 


200 


0.64 


0.09 


0.72 


0.07 


0.86 


0.02 



Table 1: Mean Q 2 and standard deviation sd of the predictivity coefficient Q 2 for several 
implementations of the g-function of Sobol. 

Indeed, sequential estimation and rerunning of the procedure require many executions 
of the Hooke & Jeeves algorithm, particularly in the case of a double cross validation 
(cf. steps 3.4 and 7 of the algorithm). Consequently, this approach is much more 
computer time expensive than the GEM-SA software. For example, for a simulation 
with d = 10 and Nls = 100, the computing time of our approach is on average ten 
times larger than that of the GEM-SA software. 

For a practitioner, a compromise is usually made between the time to obtain the 
sampling design points and the time to build a metamodel. As a conclusion of this 
section, our methodology is interesting for high dimensional input models (more than 
ten), for inadequate or small sampling designs (a few hundreds) and when simpler 
methodologies have failed. The data presented in the next section fall into this scope. 

Remark 4-1 The Gp model used in the GEM-SA software has a gaussian covariance 
function. Our model uses a generalized exponential correlation function even if it 
requires the estimation of twice as many hyperparameters. Indeed, the sequential 
approach allows to estimate a large number of hyperparameters. 
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4.2 Application on an hydrogeologic transport code 

Our methodology is now applied to the data obtained from the modeling of strontium 
90 (noted 90 Sr) transport in saturated porous media using the MARTHE software 
(developed by BRGM, the French Geological Survey). The MARTHE computer code 
models flow and transport equations in three-dimensional porous formations. In the 
context of an environmental impact study, this code is used to model 90 Sr transport 



i n sat urated media for a radwaste temporary storage site in Russia (jVolkova et al 



(2008)). One of the final purposes is to determine the short-term evolution of 90 Sr 
transport in soils in order to help rehabilitation decision making. Only a partial 
characterization of the site has been made and, consequently, values of the model 
input parameters are not known precisely. One of the first goals is to identify the most 
influential parameters of the computer code in order to improve the characterization 
of th e site in an optimal way. Because of large computing time of the MARTHE 
code, IVolkova et al.1 (|2008l ) propose to construct a metamodel on the basis of the first 
learning sample. In the following, our Gp methodology is applied and its results are 
compared to the previous ones obtained with boosting regression trees and linear 
regression. 



4.2.1 Data presentation 

Data simulated in this study are composed of 300 observations. Each simulation 
consists of 20 inputs and 20 outputs. The 20 uncertain model parameters are perme- 
ability of different geological layers composing the simulated field (parameters 1 to 
7), longitudinal dispersivity coefficients (parameters 8 to 10), transverse dispersivity 
coefficients (parameters 11 to 13), sorption coefficients (parameters 14 to 16), poros- 
ity (parameter 17) and meteoric water infiltration intensities (parameters 18 to 20). 
To study sensitivity of the MARTHE code to these parameters, simulations of these 
20 parameters have been made by the LHS method. 

For each simulated set of parameters, MARTHE code computes transport equa- 
tions of 90 Sr and predicts the evolution of 90 Sr concentration. Initial and boundary 
conditions for the flow and transport models are fixed at the same values for all 
simulations. So, for an initial map of 90 Sr concentration in 2002 and a set of 20 
input parameter values, MARTHE code computes a map of predicted concentrations 
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in 2010. For each simulation, the 20 outputs considered are values of Sr concen- 
tration, predicted for year 2010, in 20 piezometers located on the waste repository 
site. 



4.2.2 Comparison of three different models 

For each output, we choose to compare and analyze the results of three models: 

> Linear regression: it represents a model that provides a reference for the 
contribution of the Gp model stochastic component to modeling qual- 
ity. Indeed, comparison between simple linear regression and Gp model 
will show if considering spatial correlations has significant impact on the 
modeling results. Moreover, a selection based on the AICC criterion is 
implemented to optimize the results of the linear regression. 

> Boosting of regression tree s : this model was used in the previous study 



of the data (jVolkova et al.1 (120081 )). The boosting trees method is based 
on sequential construction of weak models (here regression trees with low 
interaction depth), that are then aggregated. The MART a l gorith m (Mul- 



tiple Additive Regression Trees), described in lHastie et al 



(2002 , is used 



here. The boosting trees method is relatively complex, in the sense that, 
as with neural networks, it is a black box model, efficient but quite diffi- 
cult to interpret. It is interesting to see if a Gp model, that is easier to 
interpret and offers a quickly computable predictor, can compete with a 
more complex method in terms of modeling and prediction quality. Note 
that the boosting trees algorithm also makes its proper input selection. 
> Gaussian process: to implement this model, the methodology previously 
described in this paper is applied, with the input selection procedure. 



4.2.3 Results 

To compare prediction quality of the three different models presented above, the 
coefficient of predictivity Q2 is estimated by a 6-fold cross validation. Note that 
for each model the results correspond to the optimal set of inputs included in the 
model. To avoid some bias in the results, the cross validation used to select variables 
in the Gp model (see step 6) differs from the cross validation used to validate and 
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compare prediction capabilities of the three models. Indeed, at each cross validation 
step (used to validate), data are divided into a learning sample (denoted LSI) of 
250 observations and a test sample (TS1) of the 50 remaining observations. For the 
Gp model, the procedure of variable selection is then performed by a second cross 
validation on LSI (for example: a 4-fold cross validation, dividing LSI into a learning 
set LS2 of 210 data and a test set TS2 of the 40 others). Then, an optimal set of 
variables is determined and a Gp model is built based on the 250 data of LSI (with 
this optimal set of inputs previously selected) . Finally, the model is validated on the 
test set TS1 that has never been used for the Gp model construction. 

The results are presented in Table [2] and are taken up in a barplot (see Figure [2]) ■ 
Results obtained for the output 8 (piezometer pi 10) are not considered because of 
physically insignificant concentration values. For most outputs, the Gp performance 
is superior to linear regression and boosting, in many cases substantially so. Con- 
cerning the outputs 11 (p27k) and 19 (p4a), the performances of the Gp model are 
worse than the linear regression ones. However, for these two outputs, the prediction 
errors are very high and consequently the difference of performance between the two 
models can be considered as non-significant. 

As expected, for most of the outputs, the linear regression presents the worst 
results. When this model is successfully adapted, the two others are also efficient. 
When linear regression fails (for example, for output number 12), Gp model presents 
a real interest, since it gives results as good as those of the boosting trees method. In 
fact, this is verified for all the ouputs and results are significantly better for several 
outputs (outputs 1, 2, 4, 9, 12, 13 and 16). To illustrate this, the Figure [3] shows 
the predicted values vs real values for the output 16, for the Gaussian process and 
boosting trees models. It clearly shows a better repartition of the Gp model residuals 
than the boosting trees model ones. 

Furthermore, the estimator of MSE, that is expressed analytically (see Equation 
([6])), can be used as a local prediction interval. To illustrate this, we consider 50 
observations of the output 16. Figure 0] shows the observed values, the predicted 
values and the upper and lower bounds of the 95% prediction interval based on the 
MSE local estimator. It confirms the good adequacy of the Gp model for this output 
because all the observed values (except one point) are inside the prediction interval 
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Table 2: Predictivity coefficients Q2 for the three different models of the MARTHE data. 



curves. 

4.2.4 Analysis 

These results confirm the potential of the Gp model and justify its application for 
computer codes. Application of our methodology to complex data also confirms the 
efficiency of our input selection procedure. For a fixed set of inputs in the covariance 
function, we can verify that this procedure selects the best set of inputs in regression 
part. Furthermore, the necessity of conducting sequential and ordered procedure 
estimation has been demonstrated. Indeed, if all the Gp parameters (i.e. considering 
the 20 inputs) are directly and simultaneously estimated with the DACE algorithm, 
they are not correctly determined and poor results in terms of Q2 are obtained. So, 
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Figure 2: Barplot of the predictivity coefficient Q2 for the three different models. 




Figure 3: Plot of predicted values vs real values for boosting trees (left) and Gaussian 
process (right). 



in case of a complex model with a large number of inputs, we recommend using a 
selection procedure such as the algorithm of section [3l 

The study of these data have motivated the choice of this methodology. At first, 
Welch's algorithm (see section I2.3|) has been tried. Considering the poor results 
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Figure 4: Plot of observed and Gaussian process predicted values for the output 16 with 
the 95% prediction interval based on MSE formula. 



obtained, our methodology based on the DACE estimation algorithm has been de- 
veloped. To illustrate this, let us detail the different results obtained on the output 
number 9. With our methodology based on the DACE estimation, the Q2 coefficient 
(always computed by a 6-fold cross validation) is 0.86, while with Welch's algorithm 
(used in its basic version), Q2 is close to zero. The difference in the results between 
the two methods can be explained by the value of estimated correlation parameters 
which are significantly different. 

To minimize the number of correlation parameters and consequently reduce com- 
puter time required for estimation, the possible values of power parameters pi (i = 
l,...,d) can be limited to 0.5, 1 and 2. It can be a solution to optimize computer 
time. It allows an exhaustive, quick and optimal representation of different kinds 
of correlation functions (two kinds of inflexion are represented). Furthermore, in 
many cases, estimation of power parameter with generalized exponential correlation 
converges to exponential (pi = 1) or Gaussian {pi = 2) correlation. 



25 



5 CONCLUSION 



The Gaussian process model presents some real advantages compared to other meta- 
models: exact interpolation property, simple analytical formulations of the predic- 
tor, availability of the mean squared error of the predictions and the proved ef- 



ficiency of the model. The keen int erest in this method i s testified 
lication of the recent monograp hs of 
Rasmussen and Williams! (|2006l ). 



Santner et al. 



(|2003h . iFang et al 



a y the pub- 
(|2006l ) and 



However, for its application to complex industrial problems, developing a robust 
implementation methodology is required. In this paper, we have outlined some dif- 
ficulties arising from the parameter estimation procedure (instability, high number 
of parameters) and the necessity of a progressive model construction. Moreover, an 
a priori choice of regression function and, more important, of covariance function is 
essential to parameterize the Gaussian process model. The generalized exponential 
covariance function appears in our experience as a judicious and recommended choice. 
However, this covariance function requires the estimation of 2d correlation parame- 
ters, where d is the input space dimension. In this case, the sequential estimation and 
selection procedures of our methodology are more appropriate. This methodology is 
interesting when the computer model is rather complex (non linearities, threshold 
effects, etc.), with high dimensional inputs {d > 10) and for small size samples (a few 
hundreds) . 

Results obtained on the MARTHE computer code are very encouraging and place 
the Gaussian process as a good and judicious alternative to efficient but non-explicit 
and complex methods such as boosting trees or neural networks. It has the advantage 
of being easily evaluated on a new parameter set, independently of the metamodel 
complexity. Moreover, several statistical tools are available because of the analytical 
formulation of the Gaussian model. For example, the MSE estimator offers a good 
indicator of the model's local accuracy. In the same way, inference studies can be 
developed on parameter estimators and on the choice of the experimental input de- 
sign. Finally, one possible improvement in our construction algorithm is based on the 
sequenti al approach of the choice of input des ign, which remains an active research 
domain (jScheidt and Zabalza-Mezghanil (|2004l ) for example). 
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